The corresponding journal entry for this assignment is available on GitHub at this link: https://github.com/bcb420-2022/Jedid_Ahn/wiki/Entry-%239:-Notes-from-lecture-for-A2

# Libraries to install here.
if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}

if (!requireNamespace("GEOquery", quietly = TRUE)) {
  BiocManager::install("GEOquery")
}

if (!requireNamespace("edgeR", quietly = TRUE)) {
  BiocManager::install("edgeR")
}

if (!requireNamespace("limma", quietly = TRUE)) {
  BiocManager::install("limma")
}

if (!requireNamespace("circlize", quietly = TRUE)) {
  install.packages("circlize")
}

if (!requireNamespace("ComplexHeatmap", quietly = TRUE)) {
  BiocManager::install("ComplexHeatmap")
}

if (!requireNamespace("gprofiler2", quietly = TRUE)) {
  install.packages("gprofiler2")
}

# Libraries to load here.
library(Biobase)
library(GEOquery)
library(edgeR)
library(limma)
library(circlize)
library(ComplexHeatmap)
library(gprofiler2)

# Figure count: Start at 1.
count <- 1



Introduction

GEO dataset

The GEO dataset that I chose was GSE66261, which examines the expression and functions of long noncoding RNAs during human T helper cell differentiation.

GEO <- "GSE66261"
count_folder_path <- paste0(getwd(), "/", GEO)

if (!dir.exists(count_folder_path)){
  count_file_path <- rownames(GEOquery::getGEOSuppFiles(GEO))[1]
} else{
  count_file_path <- list.files(path = count_folder_path, full.names = TRUE)[1]
}

GSE66261 <- read.delim(count_file_path, header = TRUE, check.names = FALSE)

To better understand the data, groups were defined by formatting according to library, cell type, and condition. There are 2 libraries: 2695 and 2960, 2 cell types: Primary and effector, and 3 conditions: TH1, TH2, and TH17.

listed_titles <- colnames(GSE66261)[3:14]
gsm_titles <- c("TH1_Primary_2695", "TH2_Primary_2695", "TH17_Primary_2695", "TH1_Effector_2695", "TH2_Effector_2695", "TH17_Effector_2695", "TH1_Primary_2960", "TH2_Primary_2960", "TH17_Primary_2960", "TH1_Effector_2960", "TH2_Effector_2960", "TH17_Effector_2960")
  
exp_names <- data.frame(listed_titles, gsm_titles)

samples <- data.frame(lapply(exp_names$gsm_titles, 
                             function(x) { rev(unlist(strsplit(x, split = "_"))) }))
colnames(samples) <- listed_titles
rownames(samples) <- c("library", "cell_type", "condition")
samples <- data.frame(t(samples))


Filtering

Before normalization, the dataset was filtered by removing genes with low counts. Using edgeR protocol, features without at least 1 read per million in n of the samples were removed. Since there are 6 samples of each group, n = 6.

n <- 6

# Translate out counts into counts per million using the edgeR package.
cpms = edgeR::cpm(GSE66261[, 3:14])
rownames(cpms) <- GSE66261[, 1]

# Get rid of low counts
keep = rowSums(cpms > 1) >= n
GSE66261_filtered = GSE66261[keep, ]
rownames(GSE66261) <- 1:nrow(GSE66261)


HUGO gene symbols

The original dataset already came with HUGO gene symbols, so no mapping of symbols was required. In addition, tests showed that nearly 90% of symbols lined up with the symbols obtained from biomaRt.

Tests showed that there were 18 rows in total that consists of mappings to the same symbol. Two rows were updated by replacing their original HUGO gene symbol with the mapped gene symbol, while the rest were dropped to avoid compromising normalization and downstream analysis.

any(GSE66261_filtered$Name == "RNY1P13")
## [1] FALSE
GSE66261_filtered$Name[which(GSE66261_filtered$Feature == "ENSG00000201900")] <- "RNY1P13"

any(GSE66261_filtered$Name == "STRA6LP")
## [1] FALSE
GSE66261_filtered$Name[which(GSE66261_filtered$Feature == "ENSG00000254876")] <- "STRA6LP"

remove = duplicated(GSE66261_filtered$Name)
GSE66261_filtered <- GSE66261_filtered[!remove, ]


Normalization

Trimmed means of M-value (TMM) was chosen as the normalization technique as it a specialized normalization technique for RNASeq datasets.

filtered_data_matrix <- as.matrix(GSE66261_filtered[, 3:14])
rownames(filtered_data_matrix) <- GSE66261_filtered$Name
d = edgeR::DGEList(counts = filtered_data_matrix, group = samples$cell_type)

d = edgeR::calcNormFactors(d)
normalized_counts <- as.data.frame(edgeR::cpm(d))
rownames(normalized_counts) <- GSE66261_filtered$Name


Through the MDS plot as shown in Figure 1, we can see that groupings are favourable as it is clustering according to condition rather than library or cell type.
Figure 1: We observe that TH1 and TH2 samples cluster all together as they are both involved in immune response, while TH17 samples cluster together separately from TH1 and TH2.

Figure 1: We observe that TH1 and TH2 samples cluster all together as they are both involved in immune response, while TH17 samples cluster together separately from TH1 and TH2.


By observing the boxplot as shown in Figure 2, we can visualize the distribution of the data.
Figure 2: The boxplot shows that there aren't significant differences in the mean for each sample after normalization.

Figure 2: The boxplot shows that there aren’t significant differences in the mean for each sample after normalization.


On the other hand, the density plot as shown in Figure 3 shows that the data follows a soft bimodal distribution.
Figure 3: The soft bimodal distribution is something to keep in mind when performing differential expression analysis.

Figure 3: The soft bimodal distribution is something to keep in mind when performing differential expression analysis.


Finally, we calculated dispersion as shown in Figure 4 to determine how much variance deviates from the mean.
Figure 4: The BCV plot confirms the trend of more gene expression leading to dispersion values being closer together, and less variation overall.

Figure 4: The BCV plot confirms the trend of more gene expression leading to dispersion values being closer together, and less variation overall.


This is our final dataset of normalized counts, which will be used for differential expression analysis.

normalized_counts



Differential expression analysis

Linear model

Before running differential expression analysis, we have to create a linear model in R by creating a design matrix. We will account for both library variability and the condition of interest, which is contributing to the differential expression. This is demonstrated by the MDS plot shown in Figure 1, where samples are clustering according to condition.

model_design <- model.matrix(~ samples$library + samples$condition)

Fit the normalized counts to the model design created.

expression_mat <- as.matrix(normalized_counts)
minimal_set <- Biobase::ExpressionSet(assayData = expression_mat)
fit <- limma::lmFit(minimal_set, model_design)


Computation and multiple hypothesis testing

Then, we use empirical Bayes to compute differential expression.

fit2 <- limma::eBayes(fit, trend = TRUE)

(Answer to Q2) The Benjamini Hochberg method was used as the paper associated with the experiment specified that false discovery rate was used to correct for multiple testing.

output_hits <- limma::topTable(fit2, coef = ncol(model_design), adjust.method = "BH", number = nrow(expression_mat))

# Then, we sort by pvalue
output_hits <- output_hits[order(output_hits$P.Value),]

(Answer to Q1) 2376 genes are differentially expressed pass the threshold p-value of 0.05. 0.05 was chosen as it is the most commonly used cutoff for significance, which signifies a 95% chance that differential expression would not be observed if the condition had no effect.

length(which(output_hits$P.Value < 0.05))
## [1] 2376

(Answer to Q2) On the other hand, 504 genes pass correction.

length(which(output_hits$adj.P.Val < 0.05))
## [1] 504


MA plots

(Answer to Q3) We can visualize the amount of differentially expressed genes using an MA plot as shown in Figure 5. limma’s plotMA function was used. Based on the plot, we can see that nearly all differentially expressed genes have an expression log-ratio and average log-ratio close to 0.
Figure 5: Differentially expressed genes are coloured and bolded in red.

Figure 5: Differentially expressed genes are coloured and bolded in red.


(Answer to Q3) As shown in Figure 6, we will now investigate RAD50 in more detail as the lncRNA cluster in humans overlap the RAD50 gene according to the paper.
Figure 6: We can see that the MA plot shows a high degree of confidence in RAD50 that it is differentially expressed and has high regulation.

Figure 6: We can see that the MA plot shows a high degree of confidence in RAD50 that it is differentially expressed and has high regulation.


Heatmap

(Answer to Q4) Next, we will create a heatmap to visualize upregulation and downregulation of the top hits in the data according to a colour scale. Based on Figure 7, we see that the conditions are clustered together due to the following groupings:

-TA-4 = TH1 and -TA-1 = TH1
-TA-2 = TH2 and -TA-5 = TH2
-TA-6 = TH17 and -TA-3 = TH17

Figure 7: Not only are the conditions clustered together, but their genes show similar patterns of upregulation and downregulation.

Figure 7: Not only are the conditions clustered together, but their genes show similar patterns of upregulation and downregulation.



Thresholded over-representation analysis

(Answer to Q1) Since we are focused on threshold calculation rather than rank calculation, options included DAVID, EnrichR, and g:Profiler. g:Profiler was used due to familiarity and because it is a popular tool for functional enrichment analysis and visualization of gene lists. g:Profiler was used through the gprofiler2 R package rather than through their web browser.


Create thresholded list of genes

I will start by creating a thresholded list of genes, by calculating their rank and then retrieving upregulated and downregulated genes that are significant (pval < 0.05).

output_hits[, "rank"] <- -log(output_hits$P.Value, base = 10) * sign(output_hits$logFC)
output_hits <- output_hits[order(output_hits$rank), ]

upregulated_genes <- rownames(output_hits)[which(output_hits$P.Value < 0.05 & 
                                                   output_hits$logFC > 0)]
downregulated_genes <- rownames(output_hits)[which(output_hits$P.Value < 0.05 & 
                                                     output_hits$logFC < 0)]
all_genes <- rownames(output_hits)

To present the results, I will store them in tables.

dir.create(paste0(getwd(), "/data"))

write.table(x = upregulated_genes,
            file = file.path("data", "GSE66261_upregulated_genes.txt"), sep = "\t",
            row.names = FALSE, col.names = FALSE, quote = FALSE)

write.table(x = downregulated_genes,
            file=file.path("data","GSE66261_downregulated_genes.txt"), sep = "\t",
            row.names = FALSE, col.names = FALSE, quote = FALSE)

write.table(x = data.frame(genename = rownames(output_hits), F_stat = output_hits$rank),
            file = file.path("data", "GSE66261_ranked_genelist.txt"), sep = "\t",
            row.names = FALSE, col.names = FALSE, quote = FALSE)

There are 1035 upregulated genes that are significant.

length(upregulated_genes)
## [1] 1035

On the other hand, there are 1341 upregulated genes that are significant.

length(downregulated_genes)
## [1] 1341


Gene set enrichment analysis using g:Profiler

(Answer to Q2) To maintain consistency with the differential expression analysis, Benjamini Hochberg FDR was used as the correction method for multiple hypothesis testing. Unfortunately, the corresponding paper for GSE66261 did not specify which annotation sources that they used for gene set enrichment.

GO:BP was used as gene ontology is a popular source for identifying relevant biological processes. In addition, KEGG, Reactome (REAC), and Wiki Pathways (WP) were all included as parameters due to a specific interest in learning the biological pathways associated with the genes of interest. Version 0.2.1 of the R package gprofiler2 was used, which is presumed to use the same version as the web browser tool. The web g:Profiler version is e105_eg52_p16_e84549, which was last updated on 03/01/2022.

gost_res <- gprofiler2::gost(all_genes, organism = "hsapiens", 
                             correction_method = c("fdr"),
                             user_threshold = 0.05,
                             sources = c("GO:BP", "KEGG", "REAC", "WP"))

(Answer to Q3) A p-value threshold of 0.05 was used as previously done for differential expression analysis. 3862 genesets were returned in total, with nearly 75% (2861 out of 3862) coming from GO:BP.

nrow(gost_res$result)
## [1] 3862
table(gost_res$result$source)
## 
## GO:BP  KEGG  REAC    WP 
##  2861   152   658   191

(Answer to Q3) If the p-value threshold is set to a more stringent value of 0.01, 3026 genesets are returned instead.

gost_res_stringent <- gprofiler2::gost(all_genes, organism = "hsapiens", 
                                       correction_method = c("fdr"),
                                       user_threshold = 0.01,
                                       sources = c("GO:BP", "KEGG", "REAC", "WP"))

nrow(gost_res_stringent$result)
## [1] 3026
table(gost_res_stringent$result$source)
## 
## GO:BP  KEGG  REAC    WP 
##  2208   122   559   137
Observing the plot function that gprofiler2 provides as shown in Figure 8, the term names whose -log10(p-adj) value is greater than 16 are the most significant and the ones that we are most interested in.

Figure 8: Despite the visualization, it is hard to gauge the number of term names that we are most interested in in due to the capping of values whose -log10(p-adj) > 16.


Upregulated vs downregulated g:Profiler analysis

(Answer to Q4) Next, we will run the g:Profiler analysis against the upregulated and downregulated genes both, with a threshold of 0.05.

gost_res_up <- gprofiler2::gost(upregulated_genes, organism = "hsapiens", 
                                correction_method = c("fdr"),
                                user_threshold = 0.05,
                                sources = c("GO:BP", "KEGG", "REAC", "WP"))
gost_res_down <- gprofiler2::gost(downregulated_genes, organism = "hsapiens", 
                                  correction_method = c("fdr"),
                                  user_threshold = 0.05,
                                  sources = c("GO:BP", "KEGG", "REAC", "WP"))


(Answer to Q4) Using all genes, we see that the term names that are most enriched are involved in metabolic and biosynthetic processes.

knitr::kable(gost_res$result$term_name[1:10])
x
cellular macromolecule metabolic process
organic substance biosynthetic process
organonitrogen compound metabolic process
biosynthetic process
cellular biosynthetic process
cellular protein metabolic process
regulation of cellular metabolic process
protein metabolic process
macromolecule biosynthetic process
regulation of primary metabolic process


(Answer to Q4) Using only upregulated genes that are significant, we see that the term names that are most enriched are still involved in metabolic and biosynthetic processes.

knitr::kable(gost_res_up$result$term_name[1:10])
x
regulation of cellular process
biological regulation
small molecule metabolic process
biosynthetic process
oxoacid metabolic process
organic acid metabolic process
organic substance biosynthetic process
carboxylic acid metabolic process
regulation of biological process
regulation of cellular metabolic process


(Answer to Q4) On the other hand, enrichment of downregulated genes that are significant show term names that are involved in immune response and regulation as well as response to stimuli. Based on the results using all genes, this shows that upregulated genes dominate the landscape by contributing more to differential expression and downstream pathways.

knitr::kable(gost_res_down$result$term_name[1:10])
x
immune system process
regulation of response to stimulus
response to stimulus
regulation of cellular process
positive regulation of biological process
immune response
cellular response to stimulus
biological process involved in interspecies interaction between organisms
intracellular signal transduction
response to stress



Interpretation

  1. Do the over-representation results support conclusions or mechanism discussed in the original paper?

Answer: Yes, they do. The original paper states that long non-coding RNAs (lncRNAs) play essential roles in arrays of cellular processes, which is consistent with the over-representation results of all genes and upregulated genes. The paper also states that differentiation of T helper cells is a critical step to adaptive immune response to pathogens, which is consistent with the over-representation results of downregulated genes.


  1. Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your results.

Answer: Yes, the paper “Long Non-coding RNAs: Major Regulators of Cell Stress in Cancer” by Connerty et al. explains that lncRNAs have diverse roles in cellular processes such as acting as epigenetic modulators and promoting or inhibiting transcription. On the other hand, the textbook “Molecular Biology of the Cell. 4th edition.” by Alberts et al. states that helper T cells are required for almost all adaptive immune responses.



References

Alberts, B., Johnson, A., Lewis, J., Raff, M., Roberts, K., & Walter, P. (2002). Molecular biology of the cell. 4th edition. Garland Science.
Connerty, P., Lock, R. B., & Bock, C. E. de. (2020). Long non-coding RNAs: Major regulators of cell stress in cancer. Frontiers in Oncology, 10. https://doi.org/10.3389/fonc.2020.00285
Davis, S., & Meltzer, P. (2007). GEOquery: A bridge between the gene expression omnibus (GEO) and BioConductor. Bioinformatics (Oxford, England), 14, 1846–1847. https://doi.org/10.1093/bioinformatics/btm254
Gu, Z., Eils, R., & Schlesner, M. (2016). Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics (Oxford, England), 32(18), 2847–2849. https://doi.org/10.1093/bioinformatics/btw313
Gu, Z., Gu, L., Eils, R., Schlesner, M., & Brors, B. (2014). Circlize implements and enhances circular visualization in r. Bioinformatics (Oxford, England), 30(19), 2811–2812. https://doi.org/10.1093/bioinformatics/btu393
"Isserlin, R. (2022a). "BCB420: Lecture 6 - differential expression".
"Isserlin, R. (2022b). "BCB420: Lecture 7 - annotation dataset and intro to pathway analysis".
Morgan, M. (2021). BiocManager: Access the bioconductor project package repository. https://CRAN.R-project.org/package=BiocManager
Raudvere, U., Kolberg, L., Kuzmin, I., Arak, T., Adler, P., Peterson, H., & Vilo, J. (2019). g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Research, 47(W1), W191–W198. https://doi.org/10.1093/nar/gkz369
Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., & Smyth, G. K. (2015). Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, 43(7), 47. https://doi.org/10.1093/nar/gkv007
Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics (Oxford, England), 26(1), 139–140. https://doi.org/10.1093/bioinformatics/btp616
Spurlock III, C. F., Tossberg, J. T., Guo, Y., Collier, S. P., Crooke III, P. S., & Aune, T. M. (2015). Expression and functions of long noncoding RNAs during human t helper cell differentiation. Nature Communications, 6(6932). https://doi.org/10.1038/ncomms7932